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FULLY EXPLICIT LARGE DEVIATION INEQUALITIES FOR EMPIRICAL 
PROCESSES WITH APPLICATIONS TO INFORMATION-BASED 

COMPLEXITY 

CHRISTOPH AISTLEITNER 


Abstract. In the present paper we obtain fully explicit large deviation inequalities for empir¬ 
ical processes indexed by a Vapnik-Chervonenkis class of sets (or functions). Furthermore we 
illustrate the importance of such results for the theory of information-based complexity. 


1. Introduction and statement of results 


Let £l,£ 2 > • • • be independent random variables taking values in a measurable space (X, X). The 
n-th empirical measure P n is defined by 


1 

n 




where S x denotes the Dirac measure centered at x. Furthermore, the n-th empirical process a n is 
defined by 





I EpW , 


where PW is the distribution of £,:■ In the present paper we will only be concerned with the case 
when all £j’s have the same distribution P, in which case the definition of the empirical process 
reduces to 


a n := y/n{ P„ - P). 


Let C denote a class of elements of X (that is, a class of real-valued measurable subsets of X); 
then for an element C of C we have 


(1) a n (C)=n- 1 / 2 ^( l c (6)-P(C)), 

i= 1 


where 1 c denotes the indicator function of C. For a measure Q on (X, X), we set Q(f) = J f dQ. 
Let J 7 be a class of measurable functions on (X,A). Then for / £ T we have P(/) = E(/(^,;)), 
and thus we are led to the definition 

n 

<*»(/) = n~ 1/2 J2 (/te) - , / e T, 

1=1 

as a straightforward generalization of ©■ To understand the convergence properties of «„ it is a 
natural question to investigate the quantities 


(2) sup \a n (C)\ and sup \a n (f)\, 

CeC feJ 7 

respectively. In the simple case of , £ 2 , • ■ • being “ordinary” real random variables and C being 
the class of subintervals of K., a result of this type is the well-known Glivenko-Cantelli theorem. 
In the general case it turns out that the asymptotic order of the quantities in d2j) depends on the 
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complexity (in an appropriate sense) of the classes C and J -, respectively. For a general introduc¬ 
tion into this theory, see the book of van der Vaart and Wellner [151 . 

It turns out that a particular instance in which this problem can be efficiently handled is the case 
of C being a Vapnik-Chervonenkis class (VC class). Let A be a finite subset of X. Then C is said 
to shatter A if for every subset B of A there exists a C £ C such that B = A fl C. If there exists a 
largest finite number d such that C shatters at least one set of cardinality d, then C is called a VC 
class of index d. To avoid measurability problems we will assume throughout this paper that C is 
countable. The notion of VC classes is well-established in the theory of empirical processes, since 
there exist strong bounds for the entropy of such classes. More precisely, there is a large number 
of concentration inequalities or large deviation inequalities for empirical processes indexed by a 
class of sets (or functions), provided that this class of sets (or functions) satisfies certain entropy 
conditions (see, for example, da Ha). On the other hand, there are many results on the entropy 
(such as, e.g., results on L 2 -covering numbers) of VC classes, including results of Pollard [T5J and 
Dudley [ 8 ] and culminating in the work of Haussler TO] ■ 


Large deviation inequalities for empirical processes are an important tool in information-based 
complexity, where they are used to establish the existence of low-discrepancy point sets (that 
is, point sets having an empirical distribution which is “close” to a desired distribution). This 
connection is described in more detail in Section [2] However, as far as I know, there exist no 
reasonable large deviation inequalities for empirical processes indexed by a VC class of sets (or 
functions) which are completely explicit , in the sense that they contain neither unspecified con¬ 
stants nor quantities depending on the entropy of the class of test sets in a complicated (inexplicit) 
wayQ However, for applications in numerical analysis any results containing unspecified quantities 
are clearly of limited use, and accordingly the main point in writing this note was to provide 
completely explicit inequalities for this problem, with a view towards applications in information- 
based complexity (and with a reasonable size of the constants involved). 


Theorem [T| below is a large deviation inequality for the supremum of an empirical process indexed 
by a VC class of sets. The subsequent Theorem[2]is a similar result for empirical processes indexed 
by functions. 


Theorem 1. Let £i,..., be independent random variables with values in some measurable space 
(X, X), and having common distribution P. Let C be a countable family of elements of X. Assume 
that C is a VC class of index d. Then for 


Z = sup 
cec 


ECM&)- p (C0) 

2—1 


and for all positive real numbers 77 and x we have 


Pr \Z > 1015(1 + r])d 1 + Wl + 


2 n 

1015d 


+ 2 yfnx + (2.5 + 3277 2 ) x j < 2e 


From Theorem HI by choosing 77 = 1/100 and assuming that n > d, we immediately obtain the 
following corollary, which is even easier to apply. 

Corollary 1 . If in Theorem [7] we additionally assume that n > d, then 

Pr [z > 2052 Vdn + 2 y/nx + 3203a;) < 2e“ x , 


for every x > 0. 


^The only exception which I know of is Lemma 2 in [2], which is based on results of Alexander [4]; however, the 
involved constants there are ridiculously large, and are of no practical use in the applications in information-based 
complexity which we have in mind. 
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For the statement of Theorem [5] we need the notion of a VC subgraph class. Let / be a function 
which maps X to R. Then the subgraph Cf of / is the set 

{(a:,i) £ X x R : t < f{x)}. 

For a given family of functions T, let C denote the class of all sets Cf, f £ T. If C is a VC class, 
then T is called a VC subgraph class, and by definition the VC index of T is the VC index of C. 
With this notation, we can formulate the following theorem. 


Theorem 2. Let £i be independent random variables with values in some measurable space 

(X, X), and having common distribution P. Let T be a countable family of real-valued, measurable, 
P- centered functions on (X, X) such that ||/||oo < 1 for all f £ J-. Assume furthermore that T is 
a VC subgraph class of index d. Let 

a 1 = sup E(/ 2 (&)). 

/ e .? 7 


Then for all positive real numbers 77 and x, and for 


Z = sup 
fer 


E/te) 

i —1 


we have 

Pr (^Z > (1 + 77 ) ^60 Vd^/na' 2 log(55 /a) + 14400dlog(55/<r)^ + o\J 8 nx + K^x'j < e~ x 
where k(t)) = 2.5 + 32?7 _1 . If we are willing to choose a = 1, then we obtain 

Pr ^Z > (I + 77) ^121 Vdy/n + 58000d^ + V8nx + n^x'j < e~ x , 
where n(y) is as above. 


2. Applications to information-based complexity: existence of low-discrepancy 

point sets 


The theory of information-based complexity is concerned with the amount of information on an 
object (e.g. a function) which is necessary to approximate a quantity depending on this object up 
to a given precision. More specifically, in a typical instance the “amount of information” means 
the number of function evaluations, and the problem is to approximate the integral of a d-variate 
function / (which may be assumed to come from a certain class of functions, e.g. satisfying 
certain smoothness assumptions). A standard method for approximating such integrals is the 
Quasi-Monte Carlo (QMC) method, which is based on the fact that for points xi,..., x„ £ [0, l] d 
and for a d- variate function / having bounded variation Var HKf on [0, l] d we have 


(3) 



/(x) dx 


1 

n 


i= 1 


< (Va,r HK f) .D*(xi,..., x„). 


Here (x 1 ,..., x„) is the star-discrepancy of the point set xi,..., x ra , which is defined as 


(4) 


£>;(xi,...,x n ) = sup 
AeA 


1 " 

n 


vol(A) , 


where A denotes the class of all axis-parallel boxes in [ 0 , l] d which have one vertex at the origin. 
Compare the definition of the discrepancy to (ID; however, note also that xi,..., x„ are assumed to 
be deterministic points. The inequality d3j), which is called Koksma-Hlawka inequality, essentially 
says that point sets having small discrepancy lead to small errors in QMC integration. Accord¬ 
ingly, the question asking for the necessary number of function evaluations in order to approximate 
the integral of / can be reduced to the question asking for the existence of low-discrepancy point 
sets. This theory is quite well-understood in the case of moderate d and “large” n, where low- 
discrepancy point sets can be constructed using certain “nets” (see, for example, ®). However, 
the state of our knowledge is very unsatisfactory in the case when d is large and n is assumed to be 
moderate in comparison with d, a problem which is the source of much ongoing research (for more 
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information on this problem confer the three volumes on Tractability of Multivariate Problems by 
Novak and Wozniakowski, in particular volume II [14] which is devoted to linear functionals such 
as integration). 

Let n*(e, d) denote the inverse of the star-discrepancy] that is, the smallest possible cardinality 
of a point set in [0, l] d having star-discrepancy at most e, for given e g (0,1). A fundamental 
theorem of Heinrich, Novak, Wasilkowski and Wozniakowski El asserts that 

n*(s,d) < c abs cfe~ 2 - 

Surprisingly, this result follows directly from a combination of a large deviation inequality of Ta- 
lagrand [16] with the entropy bounds for VC classes of Haussler lOi. Similar results for general 
discrepancies can be obtained in exactly the same way; however, they all contain the unspecified 
constant c a b s , which is inherited from Talagrand’s paper [16 0 The main point for writing the 
present note was to provide a numerically fully explicit version of such “inverse of the discrepancy” 
results, by avoiding this particular result of Talagrand and using instead a different, fully explicit 
large deviation inequality together with the explicit entropy bounds of Haussler. 


To define a general discrepancy, we assume that (X, A) is a measurable space and that P is a 

probability measure on this space. Let C denote a family of elements of A, and assume that 

(C P) 

xi,...,x n are n elements of X. Then the discrepancy D„, ’ ; of the points x±, ..., x n is defined as 

1 " 

-£lc(ai)-P(C) . 

i= 1 

As a consequence of Theorem [T] we will obtain the following result. 


ei, ... ,x n ) = sup 
cec 


Theorem 3. Let (X, A, P) be a probability space and let C be a countable family of elements of 
A. Then for every given e € (0,1) there exist a positive integer n together with points x±,... ,x n 
such that 


(5) 

and 


n < 


2576d(l + e) 


D£’ F \x 1 < e. 


Remark 1: This result, together with a variant of Koksma’s inequality, proves that QMC inte¬ 
gration is feasible even in the high-dimensional dimensional setting, and with respect to general 
measures. Such a general version of discrepancy theory and QMC integration is described in [2] 
and [3]. Note that discrepancies are usually defined with respect to an uncountable class of test 
sets, such as the class A in definition fl3}. However, often it is possible to replace this uncountable 
class by a countable class without changing the value of the discrepancy (in the case of , one 
may replace the class A by the subclass of those boxes whose upper-right corner has only rational 
coordinates). 


Remark 2: While Theorem [3] needs to be translated into a bound for the maximal error in numeri¬ 
cal integration by means of a Koksma-Hlawka inequality, Theorem [2] directly implies the existence 
of well-suited points for QMC integration for a VC subgraph class T of functions. However, for a 
direct application of Theorem [2] it would be necessary to calculate the entropy of an “interesting” 
class of functions whose elements one wishes to integrate numerically. As far as I know, not much 
is known in this respect, but this seems to be a good starting point for further research. 


2 It seems that in principle it should be possible to obtain a completely explicit version of the main result 
of [161 by repeating Talagrand’s arguments and keeping track of all the constants involved; however, it is clear 
that this would require a huge amount of effort. Talagrand himself writes, somewhere near the end of his 50-page 
paper: “Figuring out the best possible dependence [of the constants] given by this approach requires checking many 
computational details and more energy than the author has left at this point. ” 
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Remark 3: In [2] we obtained Theorem [3] with the upper bound in equation m replaced by 
2 26 de~ 2 . In other words, the result in Theorem [3] improves the previously known estimate by a 
factor of some tens of thousands, and brings the required number of points within reach of math¬ 
ematical software. However, in the special case of the star-discrepancy as defined in (j4j), our new 
general upper bound is (not surprisingly) weaker than the best bound currently known, which gives 
lOOcfe -2 instead of the upper bound provided by (0; see pQ. In this context, confer also [7j and [T2j . 

Throughout the remaining part of this paper, the following setting will be fixed. (X, X) denotes a 
measurable space, P denotes a probability measure on this space, and £ 1 , ? 2 ; £ 3 ,... are independent, 
identically distributed (i.i.d.) random variables having distribution P. Furthermore, C denotes a 
countable class of elements of X, and T denotes a countable family of real-valued measurable 
functions on (X, X). 

3. MASSART’S CONCENTRATION INEQUALITY FOR EMPIRICAL PROCESSES 

The key ingredient in our proof is the following result of Massart m ■ The main aim of Massart’s 
paper was to provide an explicit version of an earlier result of Talagrand, which can be seen from 
the title About the constants in Talagrand’s concentration inequalities for empirical processes of 
Massart’s paper. Massart’s result is explicit in the sense that it eliminates all unspecified constants 
from Talagrand’s result; however, it still contains a certain quantity which depends crucially on 
the entropy of the class of test functions. We state Massart’s result below, and continue our 
discussion subsequently. 

Lemma 1. Assume that ||/||oo < b < 00 for all f G T. Let Z denote either 


sup 

n 

E/&) 

or sup 

n 

E/(6)-E (/&)) 

feT 

i=1 

fer 

i= 1 


Let a 2 = sup/gjr V(/(£i)). Then for all positive real numbers 77 and x 

P (^Z > (1 + rj)E(Z) + aV2nnx + n(r])bx^ < e~ x , 
where k and n(r]) can be taken equal to k = 4 and k(j]) = 2.5 + 32rj~ 1 . Moreover, one also has 
P (^Z < (1 — p)E{Z) — G\j2n'nx — n'(r])bx ^ < e~ x , 
where k' = 5.4 and n'(r)) = 2.5 + 43.2?y _1 . 

In view of Massart’s result, the main remaining part in order to obtain our desired results is to 
find good upper bounds for E (Z). Such bounds will be given in the next section. 


4. The expected value of the supremum of an empirical process 


The following lemma will be deduced from Lemma 13.5 of [5], together with Haussler’s entropy 
bounds for VC classes. Together with Lemma[T]it allows us to obtain a fully explicit large deviation 
inequality for empirical processes indexed by sets. The case of empirical processes indexed by 
functions is treated subsequently. 


Lemma 2. Assume that C is a VC class with index d, and assume that P((7) <1/2 for all C € C. 
Set 


Z = sup 
cec 


^(icte)-p(c')) 

1=1 


Then 


E{Z) < 1015 d 



\I 1 + W5d)- 
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Proof of Lemma [H - Let 


Z+ = sup £(l c (&)-P(C')) . 


cec 


0=1 


Adopting the notation from j5] Lemma 13.5] we choose a 2 = 1/2 and set 


D = 6^2-yiL(2-i-3/2,C). 

3=0 

Here H(S,C) is the universal 8-metric entropy (also called Koltchinskii-Pollard entropy) of C which 
by 0 Lemma 13.6] is bounded by 

H(8,C) < 2dlog(e/8) + log (e(d + 1)), 8 > 0, 

since by assumption C is a VC class with index d. (Note that [5] Lemma 13.6] is just a reformulation 
of Haussler’s entropy bounds, that is, of m Corollary 1]). Thus we have 


°° ■ I - 

D < 6^ 2 _J y 2dlog(e2l+ 3 / 2 ) + log(e(d + 1)), 


j=o 

and some standard calculations show that 

(6) D<31.858Vd for d = l,2,.... 

The last inequality on page 372 of [5] states that 


E (Z + ) < 


D 2 


1 + 1 1 


4<r 2 n 

D 2 


(note that in [5] an additional factor y/n appears, which comes from the normalization in their 
definition of Z + ). Together with © and the fact that (31.858) 2 < 1015 this implies that 

(7) E (Z + ) < ^ d (1 + Jl+ 2n ^ 


J 


1015 d 

As noted on page 373 of [5], the same bound as in J7J holds for E (Z~), where 

( n \ 

z ~ = ® u p ( E ( p ( c ) - 1 cte)) 

Thus overall we have 


CeC 


vi=l 


E sup 
\cec 


This proves the lemma. 


^ (lcte) - TO) 


i =1 


< E(Z + )+E{Z~) 


1015d 1 + W1 + 


2 n A 


1015c? 


□ 


For future reference we also state the following lemma, which is a variant of Lemma [5] without the 
assumption that P(C) <1/2 for all C £ C. The proof of Lemma [3] can be given in the same way 
as the proof of Lemma [2j 

Lemma 3. If we skip the assumption thatE(C) <1/2 for all C € C in the statement of Lemma\^ 
we obtain 

4 n \ 


E (Z) < 914 d 1 + \/l + 


914c? 


The following lemma provides an upper bound for the quantity E (Z) in the case of empirical pro¬ 
cesses indexed by functions, subject to entropy bounds on the class of functions. The appropriate 
entropy bounds are contained in Haussler’s paper and are stated in Lemma[5]and, in a form more 
suitable for our application, in Lemma |H] 
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Lemma 4 (|JJJ Proposition 3]). Assume that the functions in T are P -centered and that their 
absolute values are uniformly bounded by a constant U. Assume also that for all finitely supported 
probability measures Q, the L 2 (Q )-covering numbers satisfy 

N(S,T,L 2 (Q)) < (jpj , 0 <6<U, 

for some A > e and v > 2. Let a <U be such that 

<t 2 > sup E(/ 2 (&)) 
f&tF 


Then for all n > 1 for 


Z = sup 

fer 


E/te) 

i= 1 


we have 

E (Z) < 30V2v\/n<j 2 log(5 AU/a) + lb 2 2 5 vU \og(hAU/a). 


As noted, the required entropy bounds can be deduced from Haussler’s results. 


Lemma 5 ( [10l Corollary 3]). Let Q be a probability distribution on (X, A). Assume that the 
functions in J- have only values in [0,1], and that J- is a VC subgraph class with index d. Then 
for every 5 > 0 the L 1 -packing number of T satisfies 

M{6,T,L\Q))<e(d+l)(=pj . 


Lemma 6. Let Q be a probability distribution on (X, A). Assume that the functions in J- are 
P -centered and that their absolute values are uniformly bounded by 1. Assume also that J 7 is a VC 
subgraph class of index d. Then for every S > 0 

N(5,F,L 2 (®))< (E) . 

Proof of Lemma There are three differences between the entropy bound in Lemma [5] and the 
desired bound in Lemma [6] First, the bound in Lemma [5] is stated for packing numbers, while we 
need a bound for covering numbers. Second, the bound in Lemma [5] is for the L 2 -distance, while 
we need a bound for the L 1 -distance. Third, the bound in Lemma [5] is formulated for a class of 
functions having values in [0,1], while we need a bound for centered functions (which also have 
non-negative values). 


Let T be the class of functions from the assumption of the lemma, and let Q be the class of all 
functions of the form g(x) = (f(x) + l)/2, where / € T. Then Q is a class of functions having 
values in [0,1], and Q also is a VC subgraph class of index d. It is quite easily seen that for every 
measure Q we have 

M (6, A, L\ Q)) < M (<5/2, Q, L\ Q)) 

for every <5 > 0. By a well-known relation between packing numbers and covering numbers (see 
for example IH p. 98]) we have 

JV(<S,.F,L 1 (Q)) <M{5/2,T,L\Q>)) 

for every <5 > 0. Furthermore we have 

N(S,P 1 L 2 (Q))<N(S 2 ,P,L\Q)), <5 > 0, 

since all functions in T have values in [—1,1]; this follows directly from 

11 / - g\ 2 dQ < I\f-g\d Q. 

Consequently from Lemma [5] we can deduce that overall we have 

N(6^,L 2 ( Q)) < M(5 2 /4,T,L\®)) 
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< e(d + 1 ) ( ^2 


< 


2d 


which is valid for all <5 £ ( 0 , 1 ) and d > 1. □ 

From a combination of Lemma 0] and Lemma [5] we directly obtain the following proposition. 

Proposition 1. Assume that the functions in T are F-centered and that their absolute values are 
uniformly bounded by 1. Assume also that T is a VC subgraph class of index d. Let a be such that 

cr 2 > sup E(/ 2 (&)), 

/e-F 


Then for all n> 1 for 


Z = sup 
fer 


£/<& 


we have 

E (Z) < 60 Vd\Jna 2 log(55/tr) + 15 2 2 6 dlog(55/cx). 

If we are willing to choose a = 1, then we have 

E (Z) < 12lVdn + 58000d. 

5. Proof of Theorems HJ n AND El 

Proof of Theorem QJ- We split the class C into two subclasses C\ and Co. such that 

P (C) < i for all C £ Ci, and P(<7) > ^ for all C £ C 2 . 

Now we define C 3 as the class of all the sets X\C, C £ C 2 . Then trivially Ci has VC index at most 
d , and it is easily seen that C 3 also has VC index at most d (note that the VC index does not 
change when replacing a class of sets by the class of all complements of these sets). Furthermore, 
we also have 

P(C) < i for all C £ C 3 . 

Consequently we may apply Lemma [2] to the classes Ci and C 3 . Note that we have 


Z = sup 
cec 


£ 0 M&) - p(C)) 


2=1 


( 8 ) 


< max sup 
VceCi 


£ (l c &) - P(C)) 


sup 

cec 3 


£ 0M&) - p(c')) 


Combining the result from Lemma [2] with (j5]| and Lemma [1] we obtain that for all positive real 
numbers 77 and x we have 

2 n \ 


Pr \^Z > (1 + 77)1015d ^1 + 1 + 


1015d 


+ \/nnx + k{ji)x < 2 e 


where k and n{j]) can be taken equal to n = 4 and n(rf) = 2.5 + 32 rj 1 . This proves Theorem [l] 

□ 

Proof of Theorem Under the assumptions of Theorem [2j by combining Lemma [T] and Propo¬ 
sition [Tj we obtain 


Pr( Z > (1 + 77 ) ^60 \fdyjna 2 log(55/cr) + 15 2 2 6 dlog(55/cr)^ 

+aV8nx + (2.5 + 32?7 _1 ) x ^ < e~ x , 
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for every i] > 0 and x > 0. If we are willing to choose a = 1, then we obtain 

Pr (Z > (1 + rj) (l21 y/dy/n + 58000d) + \/ 8 nx + (2.5 + 32?y^ 1 ) x'j < e ~ x , 
for every rj > 0 and x > 0 . 


□ 


Proof of Theorem ® Let 

n= [2575.5 d(l + £)e“ 2 ] , 

and apply Theorem Q] with the specific choice of rj = 1/14 and x = (log 2) + 10~ 10 . Then the 
right-hand side in the conclusion of Theorem [I] is smaller than 1, which means that there exists a 
realization of xi ,..., x n of the random variables £i,..., such that 


D n’ V \ x = 


(9) 


< 


1 

— sup 
n cec 


^(1 c(*<)-P(C)) 

i= 1 


1 

n 


^1015(1 + rf)d 



\l 1 + iMd ) 



Now it is a matter of computational routine to check that with our choice of n, 77 and x the term 
in m is actually dominated by e. Furthermore, one can easily check that 

[2575.5 d(l + e)e“ 2 ] < 2576d(l + e )£" 2 

for all £ e ( 0 , 1 ) and d = 1 , 2 ,..., which concludes the proof of the theorem. □ 
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